The Medical Expenditure Panel Survey (MEPS) is a medical data set. There is 44 variables which describe the patient’s demographic (age, race, gender), social-economic (education, income, insurance), medical information (diagnosed diseases, symptoms, health status based on self-filled survey) and health expenses. The goal is to predict the cost of healthcare based on description of given patient.
After investigating the variables, there are 8 numerical and 35 categorical independent variables. There are not significantly correlated with the dependent variable. The dependent variable is called ‘HEALTHEXP’ and describes patient’s health expenses. The boxplot of the variable ‘HEALTHEXP’ has a long tail. Therefore the variable was transformed with a logarithm of a base 3 (which is easier to explain than natural logarithm).
Two models were trained for this exercise – XGB model and linear model.
XGB model was trained with a pipeline. The pipeline one hot encoded the categorical variables and standar scaled the numerical variables. The result on the test set:
The lasso linear model was performed. Lasso provides mean for a variable selection. Final model result on the test set (the same as for XGB model):
The results were worse than for xgb model but the model was more selective with its variables.
Patient 3639 is 33 years old man. He has never been married and has a bachelor degree. He suffers from joint pains and asthma. He does not have any other positive diagnosis. He has high income and private insurance.
The XGB model predicted the cost to be 3 to the power of 6.18 while the true value was 3 to the power of 6.51.
The ceteris paribus plot for ‘POVCAT15’ indicates if he was poorer he would pay less for his healthcare (by square root of 3 times less). His age (‘AGE31X’) put him in the middle of the prediction price. If he was younger he would have lower cots and if he was older he would pay more.
If he did not have asthma he would pay 3 to the power of 0.3 times less. Also if he did not have joints pain (‘JTPAIN31’) he would pay less for his health expenses. If he had a lower education record he would spend less on his healthcare (CP plot for categorical variable called ‘EDRECODE’). If the patient had no insurance his cots would be 9 times slower but if he had the public insurance his cost would stay the same. Modeled lowered the prediction value by 3 to the power of 1.2 times due the patient gender.
Patient 975 is a young (22 years old) man. He has never been married. He graduated from high school. He is poor and in a good health. He has not been diagnosed with any diseases/condition.
The XGB model prediction was close to the true value which is 0.
The Ceteris Paribus profile for his age (variable ‘AGE31X’) does not change his prediction (up to 3 to the power 0.2 times). The income (variable ‘INCOME_M’) would increase significantly the prediction cost if he would earn more. Similarly, any change in his poor status (‘POVCAT15’) would increase the prediction. Also positive diagnosis for any diseases would increase his medical expensive. Interestingly, if patient would quit his smoking habit he would spend 3 to the power of 0.2 times more on his health expenses (variable ‘ADSMOK42’). This could mean that patient would have higher expenses because he would spend money on his health instead on cigarettes.
On the other hand, there is a patient 896. He has 54 years old and is married. He is also poor but has bigger income that patient 975. He is testes positive for high blood pressure, high cholesterol, diabetes, joint pain, arthritis, asthma, walking limitations, cognitive limitations. He does not smoke. He has a public insurance.
XGB model predicted that the patient would spent 3 to the power of 9 on health expenses and the true value was 9 to the power of 8.82.
The CP profile for his age show that if he was 20 years old or younger he would have costs lowered 3 times. Also if he would be older he would have lower costs. The age variable effect is more significant in comparison to the patient 975. The income variable would lowered his prediction. This is a opposed behavior in comparison to the CP profile of patient 975. Similarly the ‘POVCAT15’ variable has different effect on patient 896, it would first decrease and after passing value 3 would increase prediction. If the patient 886 has not been diagnosed with the high cholesterol (‘CHOLDX’) he would have the prediction cost 3 to the power of 0.48 times lower prediction. Similarly effect is observed of the variable describing his positive test for diabetes (‘DIABDX’) and cognitive limitations (‘COGLIM31’).
The same as for the patient 975, patient 896 would spent less on his healthcare if he smoked.
The patient 704 is 59 years old woman. She has never been married. She has a public insurance and low income. She is testes positive for high blood pressure, high cholesterol, heart diseases, joint pain, arthritis. She has walking, cognitive and social limitations. She is not smoking.
Her health expenses (rounded to first decimal):
The prediction of XGB model was 0.2 times higher and the prediction of linear model was 0.3 times lower.
Some different CP profile variables explanations:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import xgboost as xgb
from sklearn.linear_model import Lasso
import dalex as dx
from dalex.instance_level import CeterisParibus
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Loading data and data processing
df = pd.read_csv("MEPS_data_preprocessed.csv")
df.drop(columns = ['PANEL', 'PERSONWT'], inplace=True)
df.head(5)
# Boxplots of variables
fig, axes = plt.subplots(11, 4, figsize = (12, 48))
i = 0
for triaxis in axes:
for axis in triaxis:
df.boxplot(column = df.columns[i], ax = axis)
i = i+1
# Correlation matrix
f = plt.figure(figsize=(25, 25))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
num_inx = df.nunique()[df.nunique() > 10 ].index.tolist()
inx_to_minmax_scale = ['RTHLTH31', 'MNHLTH31', 'POVCAT15']
num_inx.extend(inx_to_minmax_scale)
num_inx.remove('HEALTHEXP')
print(f"Numerical variables ({len(num_inx)}): {num_inx}")
cat_inx = df.nunique()[df.nunique() <= 10 ].index.tolist()
cat_inx = [x for x in cat_inx if x not in num_inx]
for cat in cat_inx:
df[cat] = df[cat].astype('str')
print(f"\nCategorical variables ({len(cat_inx)}): {cat_inx}")
# Y variable
df.boxplot(column="HEALTHEXP", return_type='axes')
plt.title( "Boxplot of explained variable" )
plt.suptitle('')
plt.show()
val_test = df['HEALTHEXP'].values
df['HEALTHEXP'] = np.array([0 if v == 0 else np.log(v) / np.log(3) for v in val_test])
df.boxplot(column="HEALTHEXP", return_type='axes')
plt.title( "Boxplot of explained variable after log" )
plt.suptitle('')
plt.show()
# Dividing data
x_train, x_test, y_train, y_test = train_test_split(df.drop(["HEALTHEXP"], axis = 1),
df["HEALTHEXP"], test_size=0.2, random_state=123)
def get_model_results(model_name: str, pred_train, y_train, pred_test, y_test, print_res):
rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))
r_2_train = r2_score(y_train, pred_train)
rmse_test = np.sqrt(mean_squared_error(y_test, pred_test))
r_2_test = r2_score(y_test, pred_test)
mae_test = mean_absolute_error(y_test, pred_test)
mae_train = mean_absolute_error(y_train, pred_train)
if print_res:
print(f"{model_name} results:")
print(f"training rmse: {rmse_train}\ntraining r2: {r_2_train}\ntraining mae: {mae_train}")
print(f"test rmse: {rmse_test}\ntest r2: {r_2_test}\ntest mae: {mae_test}\n")
return [rmse_train, rmse_test, r_2_train, r_2_test, mae_train, mae_test]
numerical_transformer = Pipeline(
steps=[
('scaler', StandardScaler())
]
)
categorical_transformer = Pipeline(
steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, num_inx),
('cat', categorical_transformer, cat_inx)
]
)
# XGB model
feature_names = list(x_train)
def xgb_train_and_predict(depth, n_est):
xgb_reg = xgb.XGBRegressor(objective ='reg:squarederror',
colsample_bytree = 0.4,
learning_rate = 0.2,
max_depth = d,
alpha = 15,
n_estimators = n,
feature_names=feature_names
)
xgb_pipe = Pipeline(steps=[('preprocessor', preprocessor),
('model', xgb_reg)])
xgb_pipe.fit(x_train, y_train)
predictions_train = xgb_pipe.predict(x_train)
predictions_test = xgb_pipe.predict(x_test)
return xgb_pipe, predictions_train, predictions_test
# XGB with final parameters
d = 6
n = 60
xg_reg, predictions_train, predictions_test = xgb_train_and_predict(d, n)
xgb_res = get_model_results(f"XGB ", predictions_train, y_train, predictions_test, y_test, True)
# Linear Model with lasso
l_reg = Lasso(alpha=0.5, max_iter=10e5)
l_reg.fit(x_train, y_train)
pred_lr_test = l_reg.predict(x_test)
pred_lr_train = l_reg.predict(x_train)
lasso_res = get_model_results("Lasso Regression", pred_lr_train, y_train, pred_lr_test, y_test, True)
exp_xgb = dx.Explainer(xg_reg, x_train, y_train, label = "MEPS")
def explain_with_cp(exp: dx.Explainer, index):
return exp.predict_profile(pd.DataFrame(x_test.iloc[[index]]))
k = 3639
print("Patien no: ", k , " , prediction value: ", predictions_test[k], " true value: ", y_test.iloc[[k]].values, "\n\n")
cp_res = explain_with_cp(exp_xgb, k)
cp_res.plot(title=f"Patient {k}")
cp_res.plot(title=f"Patient {k}", variable_type = "categorical")
k = 975
print("Patien no: ", k , " , prediction value: ", predictions_test[k], " true value: ", y_test.iloc[[k]].values, "\n\n")
cp_res = explain_with_cp(exp_xgb, k)
cp_res.plot(title=f"Patient {k}")
cp_res.plot(title=f"Patient {k}", variable_type = "categorical")
k = 896
print("Patien no: ", k , " , prediction value: ", predictions_test[k], " true value: ", y_test.iloc[[k]].values, "\n\n")
cp_res = explain_with_cp(exp_xgb, k)
cp_res.plot(title=f"Patient {k}")
cp_res.plot(title=f"Patient {k}", variable_type = "categorical")
print("Creating explainer for linear model")
exp_lasso = dx.Explainer(l_reg, x_train, y_train, label = "MEPS")
k = 704
print("Patien no: ", k , " , prediction value: ", predictions_test[k], " true value: ", y_test.iloc[[k]].values, "\n\n")
cp_res = explain_with_cp(exp_xgb, k)
cp_res.plot(title=f"Patient {k}")
cp_res.plot(title=f"Patient {k}, XGB", variable_type = "categorical")
print("Patien no: ", k , " , prediction value: ", pred_lr_test[k], " true value: ", y_test.iloc[[k]].values, "\n\n")
cp_res = explain_with_cp(exp_lasso, k)
cp_res.plot(title=f"Patient {k}")
cp_res.plot(title=f"Patient {k}, Linear", variable_type = "categorical")